Query-Informed Multi-Agent Motion Prediction

In a dynamic environment, autonomous driving vehicles require accurate decision-making and trajectory planning. To achieve this, autonomous vehicles need to understand their surrounding environment and predict the behavior and future trajectories of other traffic participants. In recent years, vectorization methods have dominated the field of motion prediction due to their ability to capture complex interactions in traffic scenes. However, existing research using vectorization methods for scene encoding often overlooks important physical information about vehicles, such as speed and heading angle, relying solely on displacement to represent the physical attributes of agents. This approach is insufficient for accurate trajectory prediction models. Additionally, agents’ future trajectories can be diverse, such as proceeding straight or making left or right turns at intersections. Therefore, the output of trajectory prediction models should be multimodal to account for these variations. Existing research has used multiple regression heads to output future trajectories and confidence, but the results have been suboptimal. To address these issues, we propose QINET, a method for accurate multimodal trajectory prediction for all agents in a scene. In the scene encoding part, we enhance the feature attributes of agent vehicles to better represent the physical information of agents in the scene. Our scene representation also possesses rotational and spatial invariance. In the decoder part, we use cross-attention and induce the generation of multimodal future trajectories by employing a self-learned query matrix. Experimental results demonstrate that QINET achieves state-of-the-art performance on the Argoverse motion prediction benchmark and is capable of fast multimodal trajectory prediction for multiple agents.


Introduction
Multi-object motion prediction is an essential step in autonomous driving.It aids autonomous vehicles in making informed decisions and prevents accidents.However, traffic scenes are highly complex, involving targets, road networks, and their interactions.Prediction models need to take these entities as inputs and output reasonable multimodal trajectories that intelligent agents may take in the future.
Recently, deep learning-based methods have shown promising results in motion prediction tasks [1][2][3][4].Some studies rasterize scenes into top-down views and employ CNNs for prediction [1,5,6].While these methods are easily implementable with off-theshelf image models, they have limited applicability and come with a high cost.Given these constraints, recent research [2,4] has adopted a vectorization approach for a more efficient scene representation, extracting a set of vector nodes from the trajectories of traffic participants and map elements.Subsequently, to learn relationships between vectorized entities, such as trajectory waypoints and lane segments, some studies [7][8][9] use graph neural networks to process scenes, some studies [10] use transformers to process scenes, others [11] use point cloud models to process scenes.In addition, some research [12][13][14] has focused on the vulnerability of data-driven algorithms, Autoencoder, and GAN for trajectory prediction.
However, during the scene encoding process, the perspective of predicting the scene varies for each target.Most existing methods transform the entire scene with respect to one target agent at a time, which leads to asymmetry for other agents.This approach has lower prediction efficiency and is less robust to co-ordinate transformations.To address these issues, HiVT [15] uses rotation invariant and shift invariant scene representations, which greatly enhances the robustness of the prediction model.Additionally, many studies simply use displacement to encode vectors without considering other physical information of scene entities, such as speed, heading angle, and spatial relationships between entities.This oversight may lead to suboptimal trajectory predictions.In the decoder section, the predicted trajectories should be multimodal.However, current research [16] mostly employs non-maximum suppression (NMS) methods, setting a certain threshold, and filtering out multimodal trajectories based on L2 distances between future trajectory endpoints, which yields unsatisfactory results.
We employ a symmetric scene representation using HiVT, in which all relative features possess translational and rotational invariance.Building upon this, we introduce a query-based framework for inducing the generation of multimodal trajectories: QINET.In terms of scene encoding, our representation method incorporates additional relevant features about the target and lane segments to convey their physical significance.In the trajectory prediction decoder section, to enhance the multimodality of output trajectories, we introduce self-learnable parameter matrices for cross-attention over the agent vehicle's history, which we refer to as the query mechanism.
QINET consists of an encoder module and a decoder module.In the encoder section, we expand the scene representation of vectorized entities, incorporating new features related to targets and lane segments.We use projected vector representations for all relevant features of the traffic scene at each timestamp, providing a more detailed description of relative spatial relationships.In the decoder section, we develop a query-informed architecture that combines features from agent history, agent relationships, and the road graph as a target-centric scene representation.This is achieved through querying and self-attention learning to extract and combine scene features.The combined feature representation is used to induce the output of multimodal trajectories.Some feature representation comes from the agent's historical trajectory features, while another part comes from the output of the encoder section, which we refer to as anchor points.In this way, the network first learns to generate scene modal features with maximum diversity without environmental constraints, and then decodes future trajectories after absorbing anchor points carrying rich target-oriented contextual information.
Our contributions can be summarized as follows: first, we extend HiVT's scene representation method by incorporating new features related to targets and lane segments, providing a more detailed description of relative spatial relationships; second, we propose a queries-informed decoder based on DETR [17], which combines historical query information and anchor point information end-to-end, promoting the multimodality of output trajectories; and, third, the designed QINET is capable of making accurate and reasonable predictions.

Related Work
Traffic Scene Representation.Solving the motion prediction problem requires learning rich representations from elements in the traffic scene, including high-definition maps and the agent's past trajectories.A significant amount of research employs grid-based scene representations as model inputs [18][19][20] and utilizes standard image models [21,22] for learning.Specifically, these methods extract map elements (such as lane boundaries, stop lines, and crosswalks) from high-definition maps and render the scene into a top-down image using different colors.The agent's past trajectories are either rasterized into additional image channels [1,5] or processed by temporal models like RNNs [23,24].
Rasterization methods require mature computer technology support, but their drawback lies in being inefficient and costly.Recently, vectorization methods [2,4,25] have gained popularity due to their efficient sparse encoding and ability to capture complex structural information.Unlike rasterization methods, these approaches consider the scene as a set of entities associated with semantic and geometric attributes, and learn the relationships between these entities.VectorNet [25] employs graph neural networks to model interactions between lanes and trajectory polylines.It has also served as a backbone in some subsequent works [26,27].LaneGCN [2] constructs lane graphs from lane segments and employs multi-scale graph convolutional networks to capture long-range dependencies by learning features of graph nodes.TPCN [4] extends point cloud models to learn from a spatial-temporal set composed of trajectory waypoints and lane points.Our scene representation falls under this category as well, but what sets it apart is that all vectorized entities are characterized by relative positions, enhancing model robustness as relative displacements remain invariant to translation.The approach most related to this paper is HiVT [15], which uses a translation-invariant scene representation method that avoids using absolute positions and characterizes the geometric entities with relative positions and constructs rotation-invariant transformers to model the different interactions between the vectorized entities locally and globally.
Motion Prediction.Since social interactions are ubiquitous in traffic scenes and significantly influence the future motions of traffic agents, many motion prediction methods consider dependencies between agent behaviors and rational agent-agent interactions.They employ social pooling [28,29], graph neural networks [30,31], or attention mechanisms [20,25,[32][33][34].Inspired by the success of transformer models [10] in various fields, some recent works utilize transformers in motion prediction tasks to model spatial relationships, temporal dependencies, and relationships between agents and map elements [23,32,34,35].
In contrast, our transformer architecture differs from existing architectures by incorporating hierarchical learning of local and global representations.We encode each timestep in the local encoder, breaking down the time.In addition, we decompose the space by modeling multiple agents with a goal-centered representation that is invariant to translation and rotation in the scene.In the global encoder, we interact with all cars in the scene to obtain remote dependency information.The combination of a hierarchical structure and symmetric design allows our approach to achieve state-of-the-art prediction performance with fewer parameters and lower computational costs.

Overall Framework
This section proposes a query-informed vehicle trajectory prediction method, QINET, to induce the model to generate multi-modal predicted trajectory to the maximum extent.The overall process is shown in Figure 1.The prediction model consists of three parts: scene vectorization representation, encoder, and decoder.Firstly, in the scene vectorization representation part, we extend the feature representation in HiVT [15].For the vector node representation of agent, we use the displacement of agent and its absolute value, velocity, and its absolute value, sine-cosine value of heading angle, and timestamp length information.For lane vector node representation, we use lane segment displacement and heading angle to represent.Such scenarios represent a more detailed description of the relative spatial relationships between vector nodes.Then, our feature extraction encoder makes use of HiVT's encoder architecture to design subgraphs for local encoding.Transformer encoder is used to pay attention to time information, and global graph is used to extract and interact remote dependent information.In addition, we propose a query-informed multi-scene modality for an end-to-end learning approach to induce output multimodal prediction trajectories.In our method, the proposed generation is obtained by querying the output current encoding of transformer encoder by cross-attention method, which is the information extraction of agent historical trajectory.The multi-modality of the predicted trajectory can be promoted to the maximum extent without environmental constraints.In addition, we obtain the anchor feature representation that absorbs and carries rich goal-oriented context information through self-attentional learning of global graph output.A portion of the input to the multi-modal trajectory prediction decoder comes from the query-informed multi-scene modality features, while another portion comes from anchor features.
Sensors 2024, 24, x FOR PEER REVIEW 4 of 16 modality of the predicted trajectory can be promoted to the maximum extent without environmental constraints.In addition, we obtain the anchor feature representation that absorbs and carries rich goal-oriented context information through self-attentional learning of global graph output.A portion of the input to the multi-modal trajectory prediction decoder comes from the query-informed multi-scene modality features, while another portion comes from anchor features.
Figure 1.This is the overall framework of QINET.We utilize the graph attention mechanism (GAT) to establish A2A and L2A for extracting environmental features around participants.We set query matrices (queries) to query the historical trajectory features of the agent, obtaining diverse scenemodal features.These scene-modal features, combined with anchor features learned from the global graph output, are used to output multimodal trajectories.

Complexity Analysis
In the local encoder part, we utilized the model architecture of HiVT to decompose time and spatial dimensions, learning spatial relationships locally at each timestep.This approach reduces complexity from (( + ) ) to ( +  + ) , where N, T, and L represent the number of agents, historical timesteps, and the number of lane segments, respectively.Although we expanded the feature dimension of nodes in HiVT, this expansion has minimal impact on the overall complexity.In addition, in the decoder part, the complexity added by the method of designing query matrices to obtain multimodal scene feature representations can be calculated as (), which is lightweight compared to the local encoder.

Node Feature Representation
For the traffic agent, we extract trajectory segments at each timestamp, which take the form of directed splines.These trajectory segments, referred to as vector nodes, are characterized by their feature attributes as:  =   ,   ,   , Δ ,  |  = 1, … ,  ,  = 0, … ,19}.The main feature attributes are highlighted in red as follows: Figure 1.This is the overall framework of QINET.We utilize the graph attention mechanism (GAT) to establish A2A and L2A for extracting environmental features around participants.We set query matrices (queries) to query the historical trajectory features of the agent, obtaining diverse scenemodal features.These scene-modal features, combined with anchor features learned from the global graph output, are used to output multimodal trajectories.

Complexity Analysis
In the local encoder part, we utilized the model architecture of HiVT to decompose time and spatial dimensions, learning spatial relationships locally at each timestep.This approach reduces complexity from O (NT + L) 2 to O NT 2 + TN 2 + NL , where N, T, and L represent the number of agents, historical timesteps, and the number of lane segments, respectively.Although we expanded the feature dimension of nodes in HiVT, this expansion has minimal impact on the overall complexity.In addition, in the decoder part, the complexity added by the method of designing query matrices to obtain multimodal scene feature representations can be calculated as O(NT), which is lightweight compared to the local encoder.

Node Feature Representation
For the traffic agent, we extract trajectory segments at each timestamp, which take the form of directed splines.These trajectory segments, referred to as vector nodes, are characterized by their feature attributes as: The main feature attributes are highlighted in red as follows: Sensors 2024, 24, 9 5 of 16 where N t represents the total number of agent vehicles appearing at timestamp t. l t i denotes the co-ordinates of the i-th agent in the scene at timestamp t. d t i is the displacement vector of agent vehicle i from timestamp t − 1 to t. v t i represents the speed.α t i indicates the heading angle of the i-th agent at timestamp t. a t i is the heading vector composed of the cosine and sine of the agent's heading angle.∆t t represents the duration of the timestamp.Including this in the node features is considered due to the non-uniform sampling frequency of the Argoverse dataset, which is not consistently 0.1 s.R T i is the rotation matrix defined by the heading angle of the i-th agent at the current timestep (t = 19).b i represents the semantic feature.
For lane vector nodes, we opt to extract the co-ordinates of lane points along with their associated semantic attributes, such as dashed or solid lines, and turning directions.We vectorize lane segments into nodes similar to agent vectors, and represent them as where N l denotes the total number of lane segments, d k represents the displacement vector of the lane segment, a k is composed of the sine and cosine values of the heading angle of the lane segment, indicating the direction of the displacement vector, and b k denotes the semantic attribute.The specific expressions for d k and a k are as follows: where l 1 k and l 0 k represent the endpoint and starting point of the lane segment, respectively.In the vectorized node representation, we abstain from using any absolute positions and instead utilize relative positions.This ensures that the node feature attributes possess translational invariance.

Edge Feature Representation
Node features only represent the characteristics of agent vehicles and lane segments.The graph attention mechanism requires specifying attention targets in the scene, and encoding the features of edges between the target agent vehicle and the attention targets.Therefore, we introduce attributes for the edges between entities.For the edge attributes between agents, we describe them as follows: 19; i, j = 1, i ̸ = j} , the details are as follows: where d t ij is the relative displacement vector between agent i and agent j, v t ij is the velocity vector between them, R t i is the rotation matrix parameterized by the heading angle of center agent i at timestamp t, a t j2i represents the relative heading angle vector, d t j2i expresses the lateral and longitudinal distance of agent j relative to agent i at timestamp t, and v t j2i represents the lateral and longitudinal velocity of agent j relative to agent i at timestamp t, where x denotes lateral relative to the center node agent i, and y denotes longitudinal relative to the center node agent i.
where d t ik is the relative position vector between agent I and lane segment k at timestamp t, R t k is the rotation matrix parameterized by the heading angle α t k of lane node k, d t i2k represents the lateral and longitudinal distance from agent i to lane segment k, where x is lateral relative to lane segment k and y is longitudinal relative to lane segment k, v t i2k represents the relative velocity vector, v t i2kx and v t i2ky , respectively, denote the lateral and longitudinal velocity of agent i relative to lane segment k, and a t i2k is the relative heading angle vector.
The relative positions and velocities we propose describe the distance between two independent nodes, as well as the speed at which one node moves laterally and longitudinally towards another node.Compared to absolute representations, this relative representation provides a more detailed description of the interactions between entities, allowing downstream networks to better understand their behaviors.Additionally, this lateral and longitudinal relative representation naturally ensures translational and rotational invariance.

Local Encoder
The local encoder processes the temporal scene graph in two stages, as shown in Figure 2. In the first stage, it models the agent-agent interactions for each timestep, which we refer to as A2A.For A2A, we perform local interactions centered around each agent within a limited range.After the interaction between entities, we use a time transformer encoder module to capture temporal dependencies across the traffic scene.In the second stage, we extract the features from the last timestep of the output of the transformer encoder.These features contain information about the central agent's vehicle at the current timestep, as well as interaction information with nearby other agent vehicles in both spatial and temporal dimensions.We use these features for modeling lane-agent interactions, which we refer to as L2A.With this, after the local encoder is completed, we obtain agent features that are enriched with rich contextual information.
Agent-Agent Interaction.During the A2A step, we utilize a graph neural network.We employ multi-head cross-attention to understand the influence of different surrounding agent vehicles within each local range on the central agent vehicle.Specifically, we first apply multi-layer perceptions (MLPs) to the node attributes of the central agent vehicle and the corresponding edge attributes.This allows us to obtain a time-variant encoding Z i = z t i t = 1, . . ., T for the central agent node i, along with time-variant encodings for the surrounding neighboring nodes associated with it: where ϕ center and ϕ nbr represent MLP modules.Due to the use of relative vectors and the presence of rotation matrices, both the node attributes of the central node and its associated edge attributes possess translational and rotational invariance.Next, we use cross-attention to fuse the central node features and its edge features.The query part of the cross-attention is derived from the central node attribute z t i , while the key and value parts come from the edge attributes z t ij .Subsequently, we perform dot product [10] and gating operations [15], resulting in the output Ẑi = ẑt i t = 1, . . ., T .We then further apply an MLP to Ẑi and use residual connections to obtain the merged feature encoding S i = s t i t = 1, . . ., T , which contains information about agent interactions and updates after the interaction.Agent-Agent Interaction.During the A2A step, we utilize a graph neural network.We employ multi-head cross-attention to understand the influence of different surrounding agent vehicles within each local range on the central agent vehicle.Specifically, we first apply multi-layer perceptions (MLPs) to the node attributes of the central agent vehicle and the corresponding edge attributes.This allows us to obtain a time-variant encoding  = { | = 1, … , } for the central agent node i, along with time-variant encodings for the surrounding neighboring nodes associated with it: =    ,   ,   ,  ,  ,  , where  center and  represent MLP modules.Due to the use of relative vectors and the presence of rotation matrices, both the node attributes of the central node and its associated edge attributes possess translational and rotational invariance.Next, we use cross-attention to fuse the central node features and its edge features.The query part of the cross-attention is derived from the central node attribute  , while the key and value parts come from the edge attributes  .Subsequently, we perform dot product [10] and gating operations [15], resulting in the output   = {   | = 1, … , }.We then further apply an MLP to   and use residual connections to obtain the merged feature encoding   = {   | = 1, … , } , which contains information about agent interactions and updates after the interaction.Temporal Dependency.To further capture temporal dependencies, we apply a transformer encoder to the output   of the A2A step.Following the approach of BERT [36], we introduce learnable position embeddings at each timestamp and stack them onto   to obtain the new matrix   ∈ ℝ × ℎ .Unlike previous studies [15], we do not add an extra learnable token at the end position, resulting in   ∈ ℝ (+1)× ℎ .Instead, we directly process   through the transformer encoder to obtain the updated sequence features   = {   |  = 1, … , } and extract the final node feature  belonging to the current timestep.
This feature is then fed into the subsequent L2A module, as we have observed improved performance with this approach.During the transformer encoding process, a time mask Temporal Dependency.To further capture temporal dependencies, we apply a transformer encoder to the output S i of the A2A step.Following the approach of BERT [36], we introduce learnable position embeddings at each timestamp and stack them onto S i to obtain the new matrix Ŝi ∈ R T×d h .Unlike previous studies [15], we do not add an extra learnable token at the end position, resulting in Ŝi ∈ R (T+1)×d h .Instead, we di- rectly process Ŝi through the transformer encoder to obtain the updated sequence features . ., T and extract the final node feature h T i belonging to the current timestep.This feature is then fed into the subsequent L2A module, as we have observed improved performance with this approach.During the transformer encoding process, a time mask is applied to enforce tokens to only attend to preceding timesteps.
Agent-Lane Interaction.To facilitate information interaction between agents and lane segments, we apply another multi-head cross-attention module.First, we use a multi-layer perceptron to encode the edge features between the central agent node i and nearby lane nodes: where ϕ lane represents an MLP module.We use the current timestep's agent node feature h T i from the transformer encoder output as the query, and the edge attributes z ik between the agent and the lane segment as the key and value.The field of view is an adjustable threshold used to limit the lane nodes that need to be interactively fused with the central agent node.We obtain the final node embedding h i for central agent i.It encapsulates a rich spatiotemporal representation fused by agent i, combining the dynamic characteristics of agent i with its iterative interactions with the surrounding environment.The final local representation for all agents is defined as H = {h i | i = 1, . . ., N}, where N is the number of agents.

Global Encoder
The local encoder only achieves information interaction within a local scope, lacking remote dependency relationships within the scene.Therefore, we designed a global encoder.Similar to the A2A module in the local encoder, we employ an MLP to encode the edge attributes between agent i and agent j.
Here, T represents the current timestep.Afterwards, we use h i as the query, h j , g ij as the key and value: where W Q global , W K global , and W V global are linear transformation matrices.Then, we apply a multi-head cross-attention module to update the features of agent i: where N i contains the neighboring agents that central agent i needs to interact with, α ij represents the score weight of neighbor agent j relative to agent i.Furthermore, we pass the updated neighbor agent feature ĥi and the central agent feature h i through a gating step [15] before inputting them into the MLP module to obtain the output of the global graph h i , with feature dimensions denoted as [K, N, and D].Here, K is the number of heads in the multi-head cross-attention module, representing the number of modes in the output trajectory, N denotes the number of agents, and D is the feature dimension.

Query-Informed Multi-Scene Modality Creation
As shown in Figure 3, we relinquish the constraints of specific driving scenarios and aim to maximize the diversity of future trajectory candidates by first creating multimodal scenes through querying from the motion history of the target agent.Specifically, inspired by the approach of setting object queries in DETR [17], we define a set of learnable parameters forming a query matrix Q scene ∈ R K×D to attend to the output H i = h t i t = 1, . . .T ∈ R T×D from the temporal dependency module.This is achieved by generating K scene modal features E scene = E k scene k = 1, . . ., K ∈ R K×D E through a cross-attention mechanism: where W q , W k , and W v ∈ R D×D E are linear transformation matrices.α kt represents the contribution of the agent's feature at the t-th historical timestamp to the k-th scene mode.We apply a softmax operation to the contributions of all timestamps corresponding to the Sensors 2024, 24, 9 9 of 16 k-th scene mode to obtain the vector score k .The contribution scores corresponding to each timestamp in score k are multiplied with the feature of the agent at that timestamp and then summed, resulting in the queried k-th scene modal feature.Since the scene features are derived from the agent's historical trajectory encoding and do not contain semantic features of the surrounding environment, this ensures maximum multimodality of the scene.D E represents the dimension after linear transformation.The denominator √ D E is used for normalization and to prevent the dot product from becoming too large, which might lead to saturation in the softmax operation.
where  ,  , and  ∈  × are linear transformation matrices. represents the contribution of the agent's feature at the t-th historical timestamp to the k-th scene mode.
We apply a softmax operation to the contributions of all timestamps corresponding to the k-th scene mode to obtain the vector  .The contribution scores corresponding to each timestamp in  are multiplied with the feature of the agent at that timestamp and then summed, resulting in the queried k-th scene modal feature.Since the scene features are derived from the agent's historical trajectory encoding and do not contain semantic features of the surrounding environment, this ensures maximum multimodality of the scene. represents the dimension after linear transformation.The denominator  is used for normalization and to prevent the dot product from becoming too large, which might lead to saturation in the softmax operation.We establish a learnable scene query matrix to query the historical trajectory features of agents, obtaining modality features for multiple scenes for decoding future trajectories.

Anchor Learning
The anchors are learned end-to-end in the network to convey target-oriented environmental information while preserving diversity.We set the number of anchors to K, which is equal to the number of environmental modes, so that each anchor corresponds to one scene mode.We apply an MLP to the output of the global graph  to generate anchor features corresponding to each scene mode: where  represents an MLP module.Our model does not directly utilize predicted endpoints, but rather leverages their embeddings (i.e., pre-output features) as anchor points  anch =  anch | = 1, … ,  .These anchor points are used to inject target-oriented scene context into the generated multimodal scene features  scene .This is a new type of approach compared to previous research.In TNT [16], anchor points are manually ...  We establish a learnable scene query matrix to query the historical trajectory features of agents, obtaining modality features for multiple scenes for decoding future trajectories.

Anchor Learning
The anchors are learned end-to-end in the network to convey target-oriented environmental information while preserving diversity.We set the number of anchors to K, which is equal to the number of environmental modes, so that each anchor corresponds to one scene mode.We apply an MLP to the output of the global graph h i to generate anchor features corresponding to each scene mode: where ϕ scene represents an MLP module.Our model does not directly utilize predicted endpoints, but rather leverages their embeddings (i.e., pre-output features) as anchor points E anch = E i anch i = 1, . . ., K .These anchor points are used to inject target-oriented scene context into the generated multimodal scene features E scene .This is a new type of approach compared to previous research.In TNT [16], anchor points are manually sampled uniformly from the map.In MultiPath [1], anchor points are predefined trajectories clustered from training data.In MultiPath++ [37], anchor points are learnable model parameters that are fixed after training and independent of the input.In contrast, we propose using anchor embeddings to facilitate trajectory learning.Compared to TNT and MultiPath, our anchors are more adaptive and convenient to obtain through end-to-end learning.Compared to MultiPath++, our anchors correspond to individual samples, thus carrying specific sample-specific information.

Trajectory Prediction Head
As described above, the multimodal scene encoding E scene can be seen as unconstrained future trajectories inferred solely from the agent's history, while anchors E anchor convey target-based contextual information.Here, we combine both to allow the network to make further selections and refinements: where W 1 and W 2 are linear transformation matrices.Taking the fused features as input, the multimodal prediction head outputs the final motion predictions.For each participant, it predicts K possible future trajectories along with their confidences.The head has two branches: one regression branch predicting the trajectories for each mode, and one classification branch predicting the confidence scores for each mode.For the i-th participant, we apply residual blocks and linear layers to regress the K sequences of relative co-ordinates in the regression branch: where, p k i,t represents the predicted relative co-ordinates of the i-th participant in the k-th mode at timestep t, i.e., co-ordinates in the local co-ordinate system with the historical endpoint of the i-th participant as the origin.For the classification branch, we apply an MLP to p k i,T − p i,0 to obtain K distance embeddings, where p i,0 is the last point of the historical trajectory of the i-th agent.Then, we concatenate each distance embedding with the agent features, apply residual blocks and linear layers to output K confidence scores, O i,cls = (c i,0 , c i,1 , . . . ,c i,K−1 ).

Dataset
We utilize the Argoverse motion forecasting dataset [38], which comprises real-world traffic scenarios with agent trajectories and high-definition maps.The dataset encompasses 324,557 authentic traffic scenes.The training, validation, and test sets include 205,942, 39,472, and 78,143 scenes, respectively.Each scene is a 5 s sequence sampled at 10 Hz, containing the positions of all agents in the past 2 s.In the Argoverse motion forecasting challenge, the task is to predict the future positions of a target agent for the next 3 s based on an initial observation of the first 2 s of the scene.

Metrics
We adhere to the Argoverse benchmark and evaluate our model using metrics including minimum average displacement error (minADE), minimum final displacement error (minFDE), and miss rate (MR).These metrics allow the model to predict up to 6 trajectories for each agent.

Implementation Details
We trained all models using the AdamW optimizer [39] with an initial learning rate of 0.001 for 64 epochs.We employed a cosine annealing scheduler for learning rate decay.The number of layers for the agent-agent transformer, agent-lane transformer, temporal transformer, and global encoder was set to 1, 1, 4, and 3, respectively.The number of hidden units was 128, and there were 8 heads in all multi-head attention blocks.The local region radius for A2A was set to 20 m, and for L2A it was set to 50 m.We did not predict agents that appeared for less than two steps, unless it was the target agent.

Comparison with State-of-the-Art
In Table 1, we present the results of QINET on the Argoverse motion prediction test set, comparing it with other state-of-the-art models.The data in Table 1 are sourced from the Argoverse leaderboard.QINET outperforms all other methods in terms of minADE and minFDE, and maintains a competitive ranking in MR, verifying the superior predictive performance of our method.The sacrifice in the MR metric stems from the decoder's multimodal influence on the generated trajectories, but this influence improves the accuracy of trajectory prediction in some scenarios.In the validation set section, we compared our results with HiVT.We found that before model ensemble, our model performs better on the Argoverse validation set compared to HIVT, as shown in Table 2 with specific metrics.Our ablation study consists of four parts: the importance of each module in QINET, the importance of expanded scene representation, and the importance of layered lane transformers.We conducted these experiments on the Argoverse validation set.
Importance of Each Module.
To investigate the importance of each module for the overall network, we individually removed each module and tested its contribution on the Argoverse test set, as shown in Table 3.Each module contributes to the improvement of network performance.Firstly, without A2A, the model lacks local interactions within the prediction scenes, resulting in a decrease in model metrics.
Secondly, the absence of the temporal dependency module prevents the network from addressing temporal dependencies.Since inferring future trajectories of agents in highly dynamic environments heavily relies on historical information, the lack of the transformer encoder module significantly impairs the model's performance metrics.
Thirdly, lane information plays a crucial role in motion prediction, as road environment information constrains the trajectories of vehicles to some extent.Under such constraints, vehicles generally move along the lanes.Moreover, global graph A2A also contributes to the model's effectiveness, as global interactions can capture long-range dependency relationships, enhancing the accuracy of predictions.4.1.6.Qualitative Results of QINET In the visualizations in Figure 4, we selected representative scenes to demonstrate the qualitative results of the QINET network.The visualizations confirm that QINET is capable of performing multimodal predictions for all agents, and the predicted trajectories are reasonable and close to the ground truth.For clarity, we display only the agent's historical trajectory in yellow, the ground truth future trajectory in red, and the predicted trajectory in green.It can be observed that due to the presence of local A2A, temporal, L2A, and global A2A modules, our network effectively extracts agent features and predicts their future trajectories.Additionally, the query-informed multimodal scene encoding effectively promotes the multimodality of agent vehicle future trajectories.

✓ ✓ ✓
0.6514 0.9481 0.0891 4.1.6.Qualitative Results of QINET In the visualizations in Figure 4, we selected representative scenes to demonstrate the qualitative results of the QINET network.The visualizations confirm that QINET is capable of performing multimodal predictions for all agents, and the predicted trajectories are reasonable and close to the ground truth.For clarity, we display only the agent's historical trajectory in yellow, the ground truth future trajectory in red, and the predicted trajectory in green.It can be observed that due to the presence of local A2A, temporal, L2A, and global A2A modules, our network effectively extracts agent features and predicts their future trajectories.Additionally, the query-informed multimodal scene encoding effectively promotes the multimodality of agent vehicle future trajectories.For clarity, we visualize individual agents separately.We use orange to depict past trajectories, red for actual trajectories, and green for predicted trajectories.

Comparison with HiVT in Bad Case
In this part, we compared the output results of QINET with those of HiVT, as shown in the following Figure 5.It can be seen that some bad cases predicted in the HiVT model show better prediction results in the QINET model.In addition, in some intersection scenarios, the QINET network is capable of demonstrating awareness of turning.This is attributed to our decoder design that enhances the multimodality of the trajectory predictions.

Failed Cases
In this section, we present some scenarios where QINET predictions failed, as shown in Figure 6.Compared to HiVT, QINET's prediction results have improved, with trajectories becoming more multimodal.This improvement is due to the presence of multimodal scene query features in the decoder.scenarios, the QINET network is capable of demonstrating awareness of turning.This is attributed to our decoder design that enhances the multimodality of the trajectory predictions.In this section, we present some scenarios where QINET predictions failed, as shown in Figure 6.Compared to HiVT, QINET's prediction results have improved, with

Conclusions
This paper presents a new multi-agent prediction framework that enhances trajectory prediction accuracy by constructing extended node features and edge features.It utilizes the query mechanism in cross-attention to obtain multi-scenario modal encodings, thereby maximally promoting the multimodality of generated trajectories.Experiments demonstrate that our method achieves good results in both prediction accuracy and the multimodality of generated trajectories on the Argoverse motion prediction benchmark.Future research will focus on how to conduct more efficient lane-to-agent (L2A) processing to improve the model's inference speed.This is because considering all lane nodes within a certain radius can sometimes lead to resource wastage in some scenarios.For example, in scenarios where a vehicle is predicted to drive in the far-left lane, L2A would include nodes from the opposite lane, which may not be meaningful.

Figure 2 .
Figure 2. Overview of the local encoder diagram: the query always originates from the central agent node feature.For A2A, we extract key and value from neighboring agent nodes.Self-attention is employed in the temporal encoder.For L2A, key and value are sourced from neighboring lane nodes.In the diagram, N represents the number of agents in the scene, T denotes the number of timesteps, and L represents the number of lane nodes.

Figure 2 .
Figure 2. Overview of the local encoder diagram: the query always originates from the central agent node feature.For A2A, we extract key and value from neighboring agent nodes.Self-attention is employed in the temporal encoder.For L2A, key and value are sourced from neighboring lane nodes.In the diagram, N represents the number of agents in the scene, T denotes the number of timesteps, and L represents the number of lane nodes.

Figure 3 .
Figure 3. Construction diagram of multi-scene modality.We establish a learnable scene query matrix to query the historical trajectory features of agents, obtaining modality features for multiple scenes for decoding future trajectories.

Figure 3 .
Figure 3. Construction diagram of multi-scene modality.We establish a learnable scene query matrix to query the historical trajectory features of agents, obtaining modality features for multiple scenes for decoding future trajectories.

Figure 4 .
Figure 4. Qualitative Results of QINET.We selected several classical scenario prediction results as shown in (a-d).For clarity, we visualize individual agents separately.We use orange to depict past trajectories, red for actual trajectories, and green for predicted trajectories.

Figure 4 .
Figure 4. Qualitative Results of QINET.We selected several classical scenario prediction results as shown in (a-d).For clarity, we visualize individual agents separately.We use orange to depict past trajectories, red for actual trajectories, and green for predicted trajectories.
For the edge attributes between agent nodes and lane nodes, we describe them as follows:e al = R T i d t ik , d t i2k , v t i2k ,a t i2k | t = 0, . . ., 19; i = 1, . . ., N t ; k == 1, . . ., N l } ; the details are as follows:

Table 2 .
Comparison of QINET with HIVT on the Argoverse validation set without model ensembling.

Table 3 .
Importance of each component of our framework.